Model Selection

Visual Instruction Fine-tuning

# Visual Instruction Fine-tuning

Mistral Small 3.1 24B Instruct 2503 GGUF

This is a vision-enhanced version based on Mistral-Small-3.1-24B-Instruct-2503, supporting image-to-text generation tasks.

General Reasoner 14B Preview

A multimodal reasoning model trained on the Qwen2.5-14B base model and VisualWebInstruct-Verified dataset, supporting English task processing.

Large Language Model

Transformers English

Qwen2.5 VL 32B Instruct GGUF

Qwen2.5-VL-32B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for both images and text.

Image-to-Text English

Llama 3.2 Vision Instruct Bpmncoder

Llama 3.2 11B vision instruction fine-tuned model optimized with Unsloth, using 4-bit quantization technology, achieving 2x faster training speed

Transformers English

Qwen2.5 VL 72B Instruct GGUF

Qwen2.5-VL-72B-Instruct is a multimodal vision-language model that supports interactive generation tasks involving images and text.

Image-to-Text English

Llama 3.2 11B Vision Medical

A model fine-tuned based on unsloth/Llama-3.2-11B-Vision-Instruct, trained using Unsloth and Huggingface's TRL library, achieving a 2x speedup.

Transformers English

Llama 3.2 11B Vision Invoices Mini

A multimodal large language model fine-tuned based on unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit, supporting visual instruction understanding tasks, with Unsloth optimization doubling training speed.

Transformers English

Llama 3.2 11B Vision Radiology Mini

Vision instruction fine-tuned model optimized with Unsloth, supporting multimodal task processing

Transformers English

Vsft Llava 1.5 7b Hf Trl

A multimodal vision-language model based on LLaVA-1.5-7B trained through Visual Supervised Fine-Tuning (VSFT), supporting image understanding and dialogue generation

Transformers English

Llava V1.5 Mlp2x 336px Pretrain Vicuna 13b V1.5

LLaVA is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna.

Llava Pretrain Vicuna 7b V1.3

LLaVA is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna.

Chinese LLaVA Cllama2

An open-source, commercially available bilingual (Chinese-English) vision-language assistant that supports multimodal dialogue in both Chinese and English.

Transformers Supports Multiple Languages

Instructblip Flan T5 Xl

InstructBLIP is the vision-instruction fine-tuned version of BLIP-2, capable of performing vision-language tasks such as image caption generation and visual question answering.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase